“As I scurried across the candlelit chamber, manuscripts in hand, I thought I’d made it. Nothing would be able to hurt me anymore. Little did I know there was one last fright lurking around the corner.” This is a part of a horror story which terrifies and excites us. In the passage, we are going to analyze the data composed of horror stories written by Edgar Allan Poe, Mary Shelley, and HP Lovecraft. The data was prepared by chunking larger texts into sentences using CoreNLP’s MaxEnt sentence tokenizer. Specifically we would like to consider the similarities and the differences between the texts attributed to each author and study patterns that could be used to characterize the writing styles of the three authors.
packages.used=c("widyr","ggraph","igraph","stringr","scales","spacyr","cleanNLP","readr","stringi","ggplot2","corrplot","dplyr","tidyr","forcats","reshape2","ggridges","corrgram","textstem","tidytext","tm","topicmodels","wordcloud","RSentiment")
# check packages that need to be installed.
packages.needed=setdiff(packages.used,
intersect(installed.packages()[,1],
packages.used))
# install additional packages
if(length(packages.needed)>0){
install.packages(packages.needed, dependencies = TRUE)
}
library('widyr')
library('ggraph')
library('igraph')
library('stringr')
library('scales')
library('spacyr')
library('cleanNLP')
library('readr')
library('stringi')
library('ggplot2')
library('corrplot')
library('dplyr')
library('tidyr')
library('forcats')
library('reshape2')
library('ggridges')
library('corrgram')
library('textstem')
library('tidytext')
library('tm')
library('topicmodels')
library('wordcloud')
library("RSentiment")
# Models
source("../lib/multiplot.R")
First, before we dive into the data. Let’s take a glimpse of the data offered.
warnings('off')
## NULL
spookydata = read.csv('../data/spooky.csv', as.is = TRUE)
Then, we need to pre-precess the text data.
spookydata <- spookydata %>%
filter(str_detect(text, "^[^>]+[A-Za-z\\d]") | text == ""
)
Then, let’s find whether the sentence length among the authors will vary much.
p <- spookydata %>%
mutate(sen_len = str_length(text)) %>%
ggplot(aes(sen_len, author, fill = author)) +
geom_density_ridges() +
scale_x_log10() +
theme(legend.position = "right") +
labs(x = "Sentence length")
plot(p)
## Picking joint bandwidth of 0.0414
Looks like the three authors’ sentence length distribution varies. HP Lovecraft prefers long sentence and is more focused on using sentences with length around 200.
Second, let’s do some simple treatment to our data: remove the invalid information incluing tokens. Also, we could do the lemmatization to the words.
spooky_wrd <- lemmatize_words(spookydata) %>%
unnest_tokens(word, text) %>%
# remove stopwords
anti_join(stop_words, by = "word") %>%
count(author, word) %>%
ungroup()
In this part, lets’ make a word cloud to see the most common words used by the three authors together and separately
spooky_wrd_all <- spooky_wrd %>%
group_by(word) %>%
summarise(n = sum(n)) %>%
ungroup()
wordcloud(spooky_wrd_all$word, spooky_wrd_all$n,
max.words = 200, scale = c(2.0,0.5),
colors = RColorBrewer::brewer.pal(9, "YlOrRd")[4:10])
Those words are the most common words in the datasets, we have seen that the authors would like to use the word like “life”,“death”,“door” and “light”. It seems that the horror fictions would like to decorate the normal life with life and death. Naturally, we would assume some diffrence between different authors. Now, let’s see if there is any difference between them.
spooky_wrd_MWS <- spooky_wrd %>%
filter(author == "MWS") %>%
group_by(word) %>%
summarise(n = sum(n)) %>%
ungroup()
wordcloud(spooky_wrd_MWS$word, spooky_wrd_MWS$n,
max.words = 200, scale = c(2.0,0.5),
colors = RColorBrewer::brewer.pal(9, "YlOrRd")[4:10])
It seems that Mary Shelley is more focused on the human body and relationships.
spooky_wrd_EAP <- spooky_wrd %>%
filter(author == "EAP") %>%
group_by(word) %>%
summarise(n = sum(n)) %>%
ungroup()
wordcloud(spooky_wrd_MWS$word, spooky_wrd_MWS$n,
max.words = 200, scale = c(2.0,0.5),
colors = RColorBrewer::brewer.pal(9, "YlOrRd")[4:10])
As for Edgar Allan Poe, it seems that he is more concerned about the time. Maybe he is the kind of writer who is good at creating the atmosphere of emergencies.
spooky_wrd_HPL <- spooky_wrd %>%
filter(author == "HPL") %>%
group_by(word) %>%
summarise(n = sum(n)) %>%
ungroup()
wordcloud(spooky_wrd_MWS$word, spooky_wrd_MWS$n,
max.words = 200, scale = c(2.0,0.5),
colors = RColorBrewer::brewer.pal(9, "YlOrRd")[4:10])
Well, HP Lovecraft is different from the other two that he prefers the night. Perhaps, his story happens mostly at night.
After exploring the words, let’s start the sentiment analysis. ## Emotion
spooky_wrd <- lemmatize_words(spookydata) %>% unnest_tokens(word, text)%>%
anti_join(stop_words, by = "word")
Let’s see the sentiment analysis among the different authors.
spooky_wrd %>%
filter(author == "MWS") %>%
inner_join(get_sentiments("loughran"), by = "word") %>%
count(word, sentiment, sort = TRUE) %>%
ungroup() %>%
group_by(sentiment) %>%
top_n(8, n) %>%
mutate(word = reorder(word, n)) %>%
ungroup() %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
coord_flip()+
labs( x = NULL,y = "Sentiment Analysis") +
ggtitle("Mary Shelley: Negative Positive Words")
## Warning in mutate_impl(.data, dots): Unequal factor levels: coercing to
## character
## Warning in mutate_impl(.data, dots): binding character and factor vector,
## coercing into character vector
## Warning in mutate_impl(.data, dots): binding character and factor vector,
## coercing into character vector
## Warning in mutate_impl(.data, dots): binding character and factor vector,
## coercing into character vector
## Warning in mutate_impl(.data, dots): binding character and factor vector,
## coercing into character vector
## Warning in mutate_impl(.data, dots): binding character and factor vector,
## coercing into character vector
## Warning in mutate_impl(.data, dots): binding character and factor vector,
## coercing into character vector
As for Mary, she is mostly focused on the negative, positive, uncertainty words. And specifically, as for the negative words, she likes “fear”, “lost” and “poor”.Her top 3 positive words are “happiness”,“happy”,“pleasure”. Her top 3 positive words are “appeared”,“suddenly”,“unknown”.
spooky_wrd %>%
filter(author == "HPL") %>%
inner_join(get_sentiments("loughran"), by = "word") %>%
count(word, sentiment, sort = TRUE) %>%
ungroup() %>%
group_by(sentiment) %>%
top_n(8, n) %>%
mutate(word = reorder(word, n)) %>%
ungroup() %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
coord_flip()+
labs( x = NULL,y = "Sentiment Analysis") +
ggtitle("H P Lovecraft: Negative Positive Words")
## Warning in mutate_impl(.data, dots): Unequal factor levels: coercing to
## character
## Warning in mutate_impl(.data, dots): binding character and factor vector,
## coercing into character vector
## Warning in mutate_impl(.data, dots): binding character and factor vector,
## coercing into character vector
## Warning in mutate_impl(.data, dots): binding character and factor vector,
## coercing into character vector
## Warning in mutate_impl(.data, dots): binding character and factor vector,
## coercing into character vector
## Warning in mutate_impl(.data, dots): binding character and factor vector,
## coercing into character vector
## Warning in mutate_impl(.data, dots): binding character and factor vector,
## coercing into character vector
Similarly, HP Lovecraft is very much like Mary, mainly focusing on the negative, positive and uncertainty words. Also, his top three negative words are: “fear”,“lost” and “recall”. Surprisingly, he shares the “fear” and “lost” with Mary. While for the positive words, his top three words are “dream”,“leading”,“fantastic”. His top three words in uncertainty are “unknown”,“suddenly” and “appeared”.
spooky_wrd %>%
filter(author == "EAP") %>%
inner_join(get_sentiments("loughran"), by = "word") %>%
count(word, sentiment, sort = TRUE) %>%
ungroup() %>%
group_by(sentiment) %>%
top_n(8, n) %>%
mutate(word = reorder(word, n)) %>%
ungroup() %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
coord_flip()+
labs( x = NULL,y = "Sentiment Analysis") +
ggtitle("E A Poe: Negative Positive Words")
## Warning in mutate_impl(.data, dots): Unequal factor levels: coercing to
## character
## Warning in mutate_impl(.data, dots): binding character and factor vector,
## coercing into character vector
## Warning in mutate_impl(.data, dots): binding character and factor vector,
## coercing into character vector
## Warning in mutate_impl(.data, dots): binding character and factor vector,
## coercing into character vector
## Warning in mutate_impl(.data, dots): binding character and factor vector,
## coercing into character vector
## Warning in mutate_impl(.data, dots): binding character and factor vector,
## coercing into character vector
## Warning in mutate_impl(.data, dots): binding character and factor vector,
## coercing into character vector
Lastly, E A Poe also mainly focuses on the negative, positive and uncertainty words. Also, his top three negative words are: “doubt”,“question” and “difficulty”. Surprisingly, he differs from the previous two authors in negative words. While for the positive words, his top three words are “beautiful”,“easily”,“excited”. His top three words in uncertainty are “doubt”,“appeared” and “suddenly”.
In all, as for the sentiment words, all the authors focuses on the words “positive” “negative” and “uncertainty” while Mary and Lovecraft share favorite some negative words. E A Poe, however uses much different words from the other two authors.
Then, we may visualize the sentiment clustering word cloud as below.
spooky_wrd %>%
inner_join(get_sentiments("loughran"), by = "word") %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("#F8766D", "#00BFC4"), max.words = 500)
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): henceforward could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): admission could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): ascendancy could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): thenceforward could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): therefrom could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): whatsoever could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): injunction could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): destined could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): invention could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): accomplish could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): incredible could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): improvement could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): surpassing could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): amendment could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): deposition could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): furtherance could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): settlement could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): whereabouts could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): succeed could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): enthusiastic could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): exciting could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): prosperity could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): prevents could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): popular could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): indefinitely could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): varying could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): accomplishment could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): enable could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): impress could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): stable could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): attorney could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): contract could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): deposed could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): forbearance could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): wherewith could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): fugitive could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): fugitives could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): perpetrated could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): perpetration could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): prosecution could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): unlawful could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): encumbered could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): investigation could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): advantages could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): smooth could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): unparalleled could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): appearing could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): vagaries could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): variance could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): disappointment could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): arrested could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): doubtful could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): advancing could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): enjoying could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): satisfactory could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): contempt could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): dangerous could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): improbable could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): indeterminate could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): ordinarily could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): precautions could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): variable could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): compels could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): insistence could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): necessitated could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): obligations could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): restrains could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): restriction could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): restrictions could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): delightful could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): delights could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): precluded could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): deprived could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): excessive could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): depending could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): mistake could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): troubled could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): abnormal could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): calamity could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): confused could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): distorted could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): excessively could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): suspect could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): tragedy could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): favorite could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): improved could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): regained could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): accession could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): appealed could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): appealing could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): construe could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): construed could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): conveyance could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): counsels could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): deference could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): executor could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): henceforth could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): immateriality could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): judicial could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): lawyer could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): legal could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): legally could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): legislative could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): legislature could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): mediate could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): overruled could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): petition could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): petitions could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): prima could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): regulate could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): regulations could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): remanded could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): remedied could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): ruling could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): thereunto could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): therewith could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): usurpation could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): whereof could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): alterations could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): anticipation could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): anticipations could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): exposure could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): possibilities could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): preliminary could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): presumption could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): speculating could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): careless could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): closing could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): deceived could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): injury could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): miss could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): refuse could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): sacrifice could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): acquittal could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): alleged could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): alleging could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): convict could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): convicted could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): incapacity could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): injunctions could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): perpetrate could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): pleading could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): pleas could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): prejudiced could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): prosecute could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): usurped could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): verdict could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): susceptibility could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): breaking could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): deceased could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): diminished could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): disturb could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): error could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): failure could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): incapable could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): questioned could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): severe could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): encouraged could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): invented could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): rewarded could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): successful could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): guilty could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): neglect could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): panic could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): threatened could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): argument could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): arguments could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): barrier could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): dismal could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): harsh could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): inquiry could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): investigations could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): susceptible could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): suspended could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): believes could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): intangible could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): probabilities could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): uncertainty could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): variations could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): depressed could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): disappearance could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): drag could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): scrutiny could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): victims could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): abundance could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): achieved could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): creative could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): impressive could not be fit on page. It will not be plotted.
## Warning in comparison.cloud(., colors = c("#F8766D", "#00BFC4"), max.words
## = 500): succeeding could not be fit on page. It will not be plotted.
# The super negative Index To further anlayze the sentiment among the fictions, we could do some numerical analysis. Define the super negative Index = (#uncertainty+#Negative + #constraining)/(#Postive + #Negative + #constraining + #uncertainty)
pic1 <- spooky_wrd %>%
inner_join(get_sentiments("loughran"), by = "word") %>%
ggplot(aes(author, fill = sentiment)) +
geom_bar(position = "fill")
pic2 <- spooky_wrd %>%
inner_join(get_sentiments("loughran"), by = "word") %>%
group_by(author, id, sentiment) %>%
count() %>%
spread(sentiment, n, fill = 0) %>%
group_by(author, id) %>%
summarise(neg = sum(negative),
con = sum(constraining),
unc = sum(uncertainty),
pos = sum(positive)) %>%
arrange(id) %>%
mutate(frac_neg = 1 - pos/(pos + neg + con+unc)) %>%
ggplot(aes(frac_neg, fill = author)) +
geom_density(bw = .3, alpha = 0.5) +
theme(legend.position = "right") +
labs(x = "d")
layout <- matrix(c(1,2),1,2,byrow=TRUE)
multiplot(pic1, pic2, layout=layout)
## Warning: Removed 273 rows containing non-finite values (stat_density).
This picture directly reveals the sentiment distribution among the authors. All of the authors focuse most on negative, then positive and uncertainty. While the two male authors’ focus on uncertainty words are more than the female author: Mary.
We have been considering the single words statistics analysis for the fictions. It is interesting to do the n-gram analysis.
As usual, we do the lemmatizing and remove the stopwords.
usenet_bigrams <- lemmatize_words(spookydata) %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2)%>%
separate(bigram, c("word1", "word2"), sep = " ")%>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)%>%
unite(bigram, word1, word2, sep = " ")
usenet_bigram_counts <- usenet_bigrams %>%
count(author, bigram, sort = TRUE) %>%
ungroup() %>%
separate(bigram, c("word1", "word2"), sep = " ")
usenet_bigram_counts
bigram_tf_idf <- usenet_bigrams %>%
count(author, bigram) %>%
bind_tf_idf(bigram, author, n) %>%
arrange(desc(tf_idf))
bigram_tf_idf %>%
group_by(bigram_tf_idf$author)%>%
top_n(10, tf_idf) %>%
ungroup() %>%
mutate(bigram = reorder(bigram, tf_idf)) %>%
ggplot(aes(bigram, tf_idf, fill = author)) +
geom_col(show.legend = FALSE) +
facet_wrap(~author, scales = "free") +
ylab("tf-idf") +
coord_flip()
As we can see, Lovecraft and EAP used the words like “ha ha” and “heh heh” most, while Mary pays little attention to them. It may be a major different between the male authors and female authors of horror fictions.
Sometimes, we use the negation words in our writing. How is these authors’ using them? Because of negation, we could sometimes change the contribution of theirs’ in the sentiment analysis. Lets’ use AFINN lexicon for sentiment analysis.
AFINN <- get_sentiments("afinn")
AFINN
First, let’s explore the structure using Latent Dirichlet Allocation Moddel(LDA). This method yields an unsupervised classifictaion of documents. By seeking the clusters corresponding to different topics, we will be able to find underlting structure in the data.
# divide into documents, each representing one chapter
by_chapter <- spookydata %>%
group_by(author) %>%
#mutate(chapter = cumsum(str_detect(text, regex("^chapter ", ignore_case = TRUE)))) %>%
ungroup() #%>%
#filter(chapter > 0) %>%
#unite(document, author, text)
# split into words
by_chapter_word <- by_chapter %>%
unnest_tokens(word, text)
# find document-word counts
word_counts <- by_chapter_word %>%
anti_join(stop_words) %>%
count(author, word, sort = TRUE) %>%
ungroup()
## Joining, by = "word"
word_counts
chapters_dtm <- word_counts %>%
cast_dtm(author, word, n)
chapters_dtm
## <<DocumentTermMatrix (documents: 3, terms: 24949)>>
## Non-/sparse entries: 40182/34665
## Sparsity : 46%
## Maximal term length: 19
## Weighting : term frequency (tf)
I use the LDA model for topic modelling with potential 10 topics.And save it to the output folder.
k <- 10
chapters_lda <- LDA(chapters_dtm, k = 10,method = "Gibbs", control = list(seed = 1234))
chapters_lda
## A LDA_Gibbs topic model with 10 topics.
chapter_topics <- tidy(chapters_lda, matrix = "beta")
chapter_topics
top_terms <- chapter_topics %>%
group_by(topic) %>%
top_n(5, beta) %>%
ungroup() %>%
arrange(topic, -beta)
top_terms
top_terms %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip()
We can now see the ten topics, and the 5 most frequent words in each topic.
chapters_gamma <- tidy(chapters_lda, matrix = "gamma")
chapters_gamma
chapters_gamma <- chapters_gamma %>%
separate(document, c("author"), sep = "_", convert = TRUE)
chapters_gamma
chapters_gamma %>%
mutate(author = reorder(author, gamma * topic)) %>%
ggplot(aes(factor(topic), gamma)) +
geom_boxplot() +
facet_wrap(~ author)
Clearly, in the topic 1 which describes the atmosphere and envronment, Lovecraft is more focused on it than the other two authors.While Edgar focuses on topic 2 and Mary focuses on topic 9. The major differences revealed in this plot best illustrates the difference among the authors.
chapter_classifications <- chapters_gamma %>%
group_by(author) %>%
top_n(1, gamma) %>%
ungroup()
chapter_classifications
In all, the three authors vary in the wrting styles in many aspects including the gender, the sentence length, the focused topic and the words focus. However, the three authors are much alike each other in the distribution of emotions of the words. This may be viewed as a pattern for horror fictions.